Spatiotemporal data partitioning for distributed random forest algorithm: Air quality prediction using imbalanced big spatiotemporal data on spark distributed framework
نویسندگان
چکیده
Spatiotemporal air quality datasets are typically collected hourly in monitoring stations deployed non-uniformly across a metropolitan city. These not only big, which poses challenges on the storage and processing capacity of centralized computing systems but also imbalanced spatially heterogeneous, may result biased prediction. To address these challenges, we designed developed parallel prediction system equipped with spatiotemporal data partitioning method, distributed machine learning algorithm, Hadoop’s platform its resource scheduler/manager, Spark’s efficient in-memory execution environment, is suitable for running iterative algorithms, e.g., learning. Our proposed method accounted imbalance spatial heterogeneity features big predictive models, comply load-balancing requirement systems. Distributed Random Forest algorithm H2O library Spark framework was selected as to develop model. This an ensemble forest algorithm-level adjustments perform efficiently possible datasets. An application Tehran, Iran showed that had considerable speedup gain improved both overall accuracy class precision when working A future research direction add streaming visualization functions provide rapid reliable supporting environmental health management.
منابع مشابه
Replication Strategy for Spatiotemporal Data Based on Distributed Caching System
The replica strategy in distributed cache can effectively reduce user access delay and improve system performance. However, developing a replica strategy suitable for varied application scenarios is still quite challenging, owing to differences in user access behavior and preferences. In this paper, a replication strategy for spatiotemporal data (RSSD) based on a distributed caching system is p...
متن کاملDistributed Structured Prediction for Big Data
The biggest limitations of learning structured predictors from big data are the computation time and the memory demands. In this paper, we propose to handle those big data problems efficiently by distributing and parallelizing the resource requirements. We present a distributed structured prediction learning algorithm for large scale models that cannot be effectively handled by a single cluster...
متن کاملUsing Random Forest to Learn Imbalanced Data
In this paper we propose two ways to deal with the imbalanced data classification problem using random forest. One is based on cost sensitive learning, and the other is based on a sampling technique. Performance metrics such as precision and recall, false positive rate and false negative rate, F-measure and weighted accuracy are computed. Both methods are shown to improve the prediction accurac...
متن کاملOn the use of MapReduce for imbalanced big data using Random Forest
In this age, big data applications are increasingly becoming the main focus of attention because of the enormous increment of data generation and storage that has taken place in the last years. This situation becomes a challenge when huge amounts of data are processed to extract knowledge because the data mining techniques are not adapted to the new space and time requirements. Furthermore, rea...
متن کاملRandom forest algorithm in big data environment
Random forest method is one of the most widely applied classification algorithms at present. From the actual big data scene and requirements, the application of random forest method in the big data environment to conduct in-depth study. Due to the big data needs to process a huge number of features at the same time, and the data pattern changes constantly over time, the accuracy of a random for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Environmental Technology and Innovation
سال: 2022
ISSN: ['2352-1864']
DOI: https://doi.org/10.1016/j.eti.2022.102776